The aim of this study is to replicate the findings of a published paper that aimed to identify a gene signature that could distinguish Mycobacterium tuberculosis(TB) and Human immunodeficiency virus (HIV) co-infected patients from patients with HIV only.
Methodology of the published paper was followed beginning with initial data preprocessing followed by finding a 251 gene expression signature using support vector machine recursive feature elimination(SVM-RFE) then validating the classifier using the 6 test sets reportedly used. Receiving operating characteristic curves were plotted and the sensitivity, specificity, area under the curve(AUC) and accuracy values were found. A heatmap on the training data was created and principal component analysis was conducted on the training data and Test Set 1.
It was found that reproducibility of the results in the published paper was met with great difficulty. This was due to missing information given in regards to the methodology used in the published paper as well as an incorrect study design and methodology in addressing the aim. ### Introduction
Human immunodeficiency virus (HIV) is an enveloped retrovirus that attacks the immune system (1). Mycobacterium tuberculosis (TB) is an infectious disease with a high mortality, second only to Covid-19 (2). 10 million individuals contracted TB and 1.5 million people died from TB in 2020 with 214,000 also having HIV (2). TB only requires a few particles for infection (3). Once infected, it usually becomes latent in most individuals, but there is a 5-10% risk of TB becoming active (3).
A factor that can increase the risk of TB becoming activated is immunosuppression (3). People with HIV are 18x more likely to develop active TB (3). HIV causes depletion and exhaustion of immune cells like CD4+ t cells which are required for the immune system to fight TB (3). Thus having a co-infection of HIV/TB causes both disease progressions to accelerate (3). TB has become the leading cause of death in individuals with HIV (3).
TB is curable with the right treatment regime however, TB must first be detected (2). 45% of HIV-negative individuals will die without proper treatment but for HIV-positive individuals, nearly all will die without treatment (2), hence why detecting TB in an individual with HIV is of upmost importance.Current diagnostic tools have reduced sensitivity in detecting TB in HIV-positive individuals which can lead to false negatives (4). When combining two tests, the LAM and Xpert MTB/RIF, together, only 80% of symptomatic individuals are detected (4). Some studies have explored gene expression in HIV/TB individuals to identify possible biomarkers however currently, no gene signatures have been identified that can identify HIV/TB co-infection (4).
During 2014, a paper titled ‘Identification of a 251 Gene Expression Signature That Can Accurately Detect M. tuberculosis in Patients with and without HIV Co-Infection’ was published which reportedly found a 251 Gene Expression Signature that could distinguish HIV patients from HIV patients co-infected with TB. Critique of this paper is necessary as published findings can be used for further application and research, thus, it must be ensured that the results and conclusions drawn are accurate. Here an attempt to replicate the findings will be conducted and a judgement on the appropriateness of the method and analysis will be made.
Replication and analysis were based on 43 study subjects recruited from the Themba Lethu Clinic in Johannesburg South Africa between the 6th of September 2007 to the 16th of October 2008. The 43 subjects consisted of 22 patients that were mono-infected with HIV and 21 patients that were co-infected with HIV/TB.
Raw gene expression microarray data was reportedly quantile normalized. Retrieval of the data revealed that normalized data was provided and that access to the pre-normalized data was not provided, therefore, quantile normalization was unable to be replicated and subsequent analysis was based on the normalized data given. Figure 1 displays boxplots indicating quantile normalization on the gene expression data.
Origin <- getGEO("GSE50834", GSEMatrix = TRUE)[[1]]
Pheno = pData(Origin)
Genes = fData(Origin)
Expression <- exprs(Origin)
rnames = row.names(Expression)
cnames = colnames(Expression)
Expression %>%
as.data.frame() %>%
filter_all(all_vars(. < 1000)) %>%
boxplot(las =2, cex.axis = 0.5)
Figure 1: Boxplot of Gene Expression of Samples
There existed one patient with two sample IDs of which were GSM1230903 and GSM1230904. Prior to further analysis, the gene expression data was averaged for this patient.
Expression_cut <- Expression %>%
as.data.frame() %>%
mutate(GSM1230946=(GSM1230903 + GSM1230904)/2) %>%
dplyr::select(-GSM1230903) %>%
dplyr::select(-GSM1230904) %>%
relocate(GSM1230946,.after = GSM1230902) %>%
dplyr::rename(GSM1230903=GSM1230946)
row.names(Expression_cut) = rnames
colnames(Expression_cut) = cnames[-3]
Pheno_cut <- Pheno %>% filter(title != "HIV infected patient 2_2")
It was reported that non-informative probes that had a maximum fold change less than 1.2 between any two samples were removed due to showing little variation or being expressed at the background level. Upon replication, fold change was calculated for all genes between any two samples and there were no genes found to have a fold change less than 1.2.
#calculating fold change between samples for all genes
data = Expression_cut
Keep = c()
for (row in 1:nrow(data)) {
changes_forward = c()
names_forward = c()
changes_backward = c()
names_backward = c()
for (a in 1:(ncol(data)-1)){
col1 = a
col2 = a
while(col2 <= ncol(data)-1) {
col2 = col2 + 1
fc_forward = data[row,col1]/data[row,col2]
changes_forward = c(changes_forward, fc_forward)
fc_backward = data[row,col2]/data[row,col1]
changes_backward = c(changes_backward, fc_backward)
}
}
foldchange = data.frame(change_f = changes_forward,
change_b = changes_backward)
foldchange$max <- apply(foldchange, MARGIN = 1, FUN = max)
if (max(foldchange$max) <1.2) {
truth = FALSE
}
else {
truth = TRUE
}
Keep = c(Keep, truth)
}
table(Keep)
## Keep
## TRUE
## 15529
Multiple iterations of support vector machine(SVM) were run on the full dataset where 10% of the bottom weights were removed in each iteration until 251 genes were returned. A score was assigned by the SVM classifier to each sample with a positive score indicating a prediction of TB and a negative score indicating a prediction of no active TB. A heatmap of the gene expression was produced to visualise whether the genes could be clustered into groups.
TB associated datasets were reportedly used to validate the reported study, hence these datasets were also used upon replication. The associated test sets (TS1,TS3,TS4,TS5) were normalized through replication of the reported steps. Normalization consisted of having the mean expression per probe in each dataset separately calculated and the overall mean calculated. Within a sample, each probe’s expression level was adjusted by this difference. TS6 was normalized through median quantile normalization followed by adjustment of each sample’s expression by the difference between the mean overall expression and the mean across TS1,TS3,TS4,TS5 as reported.
The dataset stemming from GSE19435 was split into TS1 and TS2 of which TS1 consisted of 12 healthy patients and 7 active TB patients on 2 months of treatment and TS2 consisted of 7 active TB patients on no treatment and 7 active TB patients on 12 months of treatment. TS3 consisted of 12 healthy patients, 17 latent TB patients and 12 active TB patients on no treatment. There were 12 healthy patients in TS4, 21 latent TB patients and 21 active TB patients on no treatment in TS4, 31 latent TB patients and 20 active TB patients on no treatment in TS5, and 38 latent TB patients and 29 active TB patients on no treatment in TS6.
TS1 <- getGEO("GSE19435", GSEMatrix = TRUE)[[1]]
TS3 <- getGEO("GSE19439", GSEMatrix = TRUE)[[1]]
TS4<- getGEO("GSE19444", GSEMatrix = TRUE)[[1]]
TS5 <- getGEO("GSE19442", GSEMatrix = TRUE)[[1]]
TS6 <- getGEO("GSE40553", GSEMatrix = TRUE)[[1]]
#Sample(patient) data
Pheno_TS1 = pData(TS1)
#Gene data
Genes_TS1 = fData(TS1)
#gene expression data
Expression_TS1 <- exprs(TS1)
rnames_TS1 = row.names(Expression_TS1)
cnames_TS1 = colnames(Expression_TS1)
TS1_mean.df <- as.data.frame(Expression_TS1) %>%
rowwise() %>%
mutate(TS1_mean = mean(c_across(1:33)))
Expression_TS1_gene <- as.data.frame(Expression_TS1) %>% tibble::rownames_to_column("Gene")
E_TS1_gene <- Expression_TS1_gene %>% dplyr::select(Gene)
TS1_mean_gene <- cbind(E_TS1_gene, TS1_mean.df)
#Sample(patient) data
Pheno_TS3 = pData(TS3)
#Gene data
Genes_TS3 = fData(TS3)
#gene expression data
Expression_TS3 <- exprs(TS3)
rnames_TS3 = row.names(Expression_TS3)
cnames_TS3 = colnames(Expression_TS3)
TS3_mean.df <- as.data.frame(Expression_TS3) %>%
rowwise() %>%
mutate(TS3_mean = mean(c_across(1:42)))
Expression_TS3_gene <- as.data.frame(Expression_TS3) %>% tibble::rownames_to_column("Gene")
E_TS3_gene <- Expression_TS3_gene %>% dplyr::select(Gene)
TS3_mean_gene <- cbind(E_TS3_gene, TS3_mean.df)
#Sample(patient) data
Pheno_TS4 = pData(TS4)
#Gene data
Genes_TS4 = fData(TS4)
#gene expression data
Expression_TS4 <- exprs(TS4)
rnames_TS4 = row.names(Expression_TS4)
cnames_TS4 = colnames(Expression_TS4)
TS4_mean.df <- as.data.frame(Expression_TS4) %>%
rowwise() %>%
mutate(TS4_mean = mean(c_across(1:54)))
Expression_TS4_gene <- as.data.frame(Expression_TS4) %>% tibble::rownames_to_column("Gene")
E_TS4_gene <- Expression_TS4_gene %>% dplyr::select(Gene)
TS4_mean_gene <- cbind(E_TS4_gene, TS4_mean.df)
#Sample(patient) data
Pheno_TS5 = pData(TS5)
#Gene data
Genes_TS5 = fData(TS5)
#gene expression data
Expression_TS5 <- exprs(TS5)
rnames_TS5 = row.names(Expression_TS5)
cnames_TS5 = colnames(Expression_TS5)
TS5_mean.df <- as.data.frame(Expression_TS5) %>%
rowwise() %>%
mutate(TS5_mean = mean(c_across(1:51)))
Expression_TS5_gene <- as.data.frame(Expression_TS5) %>% tibble::rownames_to_column("Gene")
E_TS5_gene <- Expression_TS5_gene %>% dplyr::select(Gene)
TS5_mean_gene <- cbind(E_TS5_gene, TS5_mean.df)
#Sample(patient) data
Pheno_TS6 = pData(TS6)
#Gene data
Genes_TS6 = fData(TS6)
#gene expression data
Expression_TS6 <- exprs(TS6)
rnames_TS6 = row.names(Expression_TS6)
cnames_TS6 = colnames(Expression_TS6)
TS6_quantile_normalized <- as.data.frame(normalize.quantiles(exprs(TS6)))
#Expression_TS6 %>%
# as.data.frame() %>%
# filter_all(all_vars(. < 1000)) %>%
# boxplot()
#TS6_quantile_normalized %>%
#as.data.frame() %>%
# filter_all(all_vars(. < 1000)) %>%
# boxplot()
row.names(TS6_quantile_normalized) = rnames_TS6
colnames(TS6_quantile_normalized) = cnames_TS6
TS6_mean.df <- TS6_quantile_normalized %>% rowwise() %>% dplyr::mutate(TS6_mean = mean(c_across(1:204)))
Expression_TS6_gene <- as.data.frame(Expression_TS6) %>% tibble::rownames_to_column("Gene")
E_TS6_gene <- Expression_TS6_gene %>% dplyr::select(Gene)
TS6_mean_gene <- cbind(E_TS6_gene, TS6_mean.df)
TS1_5 <- TS1_mean_gene %>% full_join(TS3_mean_gene) %>%
full_join(TS3_mean_gene) %>%
full_join(TS4_mean_gene) %>%
full_join(TS5_mean_gene)
# average and difference across all samples from the 4 datasets (TS1,3,4,5)
TS1_5_average_and_diff <- TS1_5 %>% dplyr::select(Gene, TS1_mean, TS3_mean, TS4_mean, TS5_mean) %>%
rowwise %>%
mutate(TS1_3_4_5_mean = mean(c_across(2:5))) %>%
mutate(TS1_diff = TS1_3_4_5_mean - TS1_mean) %>%
mutate(TS3_diff = TS1_3_4_5_mean - TS3_mean) %>%
mutate(TS4_diff = TS1_3_4_5_mean - TS4_mean) %>%
mutate(TS5_diff = TS1_3_4_5_mean - TS5_mean)
#TS1 normalized
TS1_normalized<- TS1_mean_gene %>% left_join(TS1_5_average_and_diff) %>%
dplyr::select(-TS3_mean, -TS4_mean, -TS5_mean, -TS3_diff, -TS4_diff, -TS5_diff, -TS1_mean, -TS1_3_4_5_mean) %>%
rowwise() %>%
mutate(GSM484368 = GSM484368 + TS1_diff) %>%
mutate(GSM484369 = GSM484369 + TS1_diff) %>%
mutate(GSM484370 = GSM484370 + TS1_diff) %>%
mutate(GSM484371 = GSM484371 + TS1_diff) %>%
mutate(GSM484372 = GSM484372 + TS1_diff) %>%
mutate(GSM484373 = GSM484373 + TS1_diff) %>%
mutate(GSM484374 = GSM484374 + TS1_diff) %>%
mutate(GSM484375 = GSM484375 + TS1_diff) %>%
mutate(GSM484376 = GSM484376 + TS1_diff) %>%
mutate(GSM484377 = GSM484377 + TS1_diff) %>%
mutate(GSM484378 = GSM484378 + TS1_diff) %>%
mutate(GSM484379 = GSM484379 + TS1_diff) %>%
mutate(GSM484380 = GSM484380 + TS1_diff) %>%
mutate(GSM484381 = GSM484381 + TS1_diff) %>%
mutate(GSM484382 = GSM484382 + TS1_diff) %>%
mutate(GSM484383 = GSM484383 + TS1_diff) %>%
mutate(GSM484384 = GSM484384 + TS1_diff) %>%
mutate(GSM484385 = GSM484385 + TS1_diff) %>%
mutate(GSM484386 = GSM484386 + TS1_diff) %>%
mutate(GSM484387 = GSM484387 + TS1_diff) %>%
mutate(GSM484388 = GSM484388 + TS1_diff) %>%
mutate(GSM484389 = GSM484389 + TS1_diff) %>%
mutate(GSM484390 = GSM484390 + TS1_diff) %>%
mutate(GSM484391 = GSM484391 + TS1_diff) %>%
mutate(GSM484392 = GSM484392 + TS1_diff) %>%
mutate(GSM484393 = GSM484393 + TS1_diff) %>%
mutate(GSM484394 = GSM484394 + TS1_diff) %>%
mutate(GSM484395 = GSM484395 + TS1_diff) %>%
mutate(GSM484396 = GSM484396 + TS1_diff) %>%
mutate(GSM484397 = GSM484397 + TS1_diff) %>%
mutate(GSM484398 = GSM484398 + TS1_diff) %>%
mutate(GSM484399 = GSM484399 + TS1_diff) %>%
mutate(GSM484400 = GSM484400 + TS1_diff) %>%
dplyr::select(-TS1_diff)
#TS3 normalized
TS3_normalized<- TS3_mean_gene %>% left_join(TS1_5_average_and_diff) %>%
dplyr::select(-TS1_mean, -TS4_mean, -TS5_mean, -TS1_diff, -TS4_diff, -TS5_diff, -TS3_mean, -TS1_3_4_5_mean) %>%
rowwise() %>%
mutate(GSM484448 = GSM484448 + TS3_diff) %>%
mutate(GSM484449 = GSM484449 + TS3_diff) %>%
mutate(GSM484450 = GSM484450 + TS3_diff) %>%
mutate(GSM484451 = GSM484451 + TS3_diff) %>%
mutate(GSM484452 = GSM484452 + TS3_diff) %>%
mutate(GSM484453 = GSM484453 + TS3_diff) %>%
mutate(GSM484454 = GSM484454 + TS3_diff) %>%
mutate(GSM484455 = GSM484455 + TS3_diff) %>%
mutate(GSM484456 = GSM484456 + TS3_diff) %>%
mutate(GSM484457 = GSM484457 + TS3_diff) %>%
mutate(GSM484458 = GSM484458 + TS3_diff) %>%
mutate(GSM484459 = GSM484459 + TS3_diff) %>%
mutate(GSM484460 = GSM484460 + TS3_diff) %>%
mutate(GSM484461 = GSM484461 + TS3_diff) %>%
mutate(GSM484462 = GSM484462 + TS3_diff) %>%
mutate(GSM484463 = GSM484463 + TS3_diff) %>%
mutate(GSM484464 = GSM484464 + TS3_diff) %>%
mutate(GSM484465 = GSM484465 + TS3_diff) %>%
mutate(GSM484466 = GSM484466 + TS3_diff) %>%
mutate(GSM484467 = GSM484467 + TS3_diff) %>%
mutate(GSM484468 = GSM484468 + TS3_diff) %>%
mutate(GSM484469 = GSM484469 + TS3_diff) %>%
mutate(GSM484470 = GSM484470 + TS3_diff) %>%
mutate(GSM484471 = GSM484471 + TS3_diff) %>%
mutate(GSM484472 = GSM484472 + TS3_diff) %>%
mutate(GSM484473 = GSM484473 + TS3_diff) %>%
mutate(GSM484474 = GSM484474 + TS3_diff) %>%
mutate(GSM484475 = GSM484475 + TS3_diff) %>%
mutate(GSM484476 = GSM484476 + TS3_diff) %>%
mutate(GSM484477 = GSM484477 + TS3_diff) %>%
mutate(GSM484478 = GSM484478 + TS3_diff) %>%
mutate(GSM484479 = GSM484479 + TS3_diff) %>%
mutate(GSM484480 = GSM484480 + TS3_diff) %>%
mutate(GSM484481 = GSM484481 + TS3_diff) %>%
mutate(GSM484482 = GSM484482 + TS3_diff) %>%
mutate(GSM484483 = GSM484483 + TS3_diff) %>%
mutate(GSM484484 = GSM484484 + TS3_diff) %>%
mutate(GSM484485 = GSM484485 + TS3_diff) %>%
mutate(GSM484486 = GSM484486 + TS3_diff) %>%
mutate(GSM484487 = GSM484487 + TS3_diff) %>%
mutate(GSM484488 = GSM484488 + TS3_diff) %>%
mutate(GSM484489 = GSM484489 + TS3_diff) %>%
dplyr::select(-TS3_diff)
#TS4 normalized
TS4_normalized<- TS4_mean_gene %>% left_join(TS1_5_average_and_diff) %>%
dplyr::select(-TS1_mean, -TS3_mean, -TS5_mean, -TS1_diff, -TS3_diff, -TS5_diff, -TS4_mean, -TS1_3_4_5_mean) %>%
rowwise() %>%
mutate(GSM484595 = GSM484595 + TS4_diff) %>%
mutate(GSM484596 = GSM484596 + TS4_diff) %>%
mutate(GSM484597 = GSM484597 + TS4_diff) %>%
mutate(GSM484598 = GSM484598 + TS4_diff) %>%
mutate(GSM484599 = GSM484599 + TS4_diff) %>%
mutate(GSM484600 = GSM484600 + TS4_diff) %>%
mutate(GSM484601 = GSM484601 + TS4_diff) %>%
mutate(GSM484602 = GSM484602 + TS4_diff) %>%
mutate(GSM484603 = GSM484603 + TS4_diff) %>%
mutate(GSM484604 = GSM484604 + TS4_diff) %>%
mutate(GSM484605 = GSM484605 + TS4_diff) %>%
mutate(GSM484606 = GSM484606 + TS4_diff) %>%
mutate(GSM484607 = GSM484607 + TS4_diff) %>%
mutate(GSM484608 = GSM484608 + TS4_diff) %>%
mutate(GSM484609 = GSM484609 + TS4_diff) %>%
mutate(GSM484610 = GSM484610 + TS4_diff) %>%
mutate(GSM484611 = GSM484611 + TS4_diff) %>%
mutate(GSM484612 = GSM484612 + TS4_diff) %>%
mutate(GSM484613 = GSM484613 + TS4_diff) %>%
mutate(GSM484614 = GSM484614 + TS4_diff) %>%
mutate(GSM484615 = GSM484615 + TS4_diff) %>%
mutate(GSM484616 = GSM484616 + TS4_diff) %>%
mutate(GSM484617 = GSM484617 + TS4_diff) %>%
mutate(GSM484618 = GSM484618 + TS4_diff) %>%
mutate(GSM484619 = GSM484619 + TS4_diff) %>%
mutate(GSM484620 = GSM484620 + TS4_diff) %>%
mutate(GSM484621 = GSM484621 + TS4_diff) %>%
mutate(GSM484622 = GSM484622 + TS4_diff) %>%
mutate(GSM484623 = GSM484623 + TS4_diff) %>%
mutate(GSM484624 = GSM484624 + TS4_diff) %>%
mutate(GSM484625 = GSM484625 + TS4_diff) %>%
mutate(GSM484626 = GSM484626 + TS4_diff) %>%
mutate(GSM484627 = GSM484627 + TS4_diff) %>%
mutate(GSM484628 = GSM484628 + TS4_diff) %>%
mutate(GSM484629 = GSM484629 + TS4_diff) %>%
mutate(GSM484630 = GSM484630 + TS4_diff) %>%
mutate(GSM484631 = GSM484631 + TS4_diff) %>%
mutate(GSM484632 = GSM484632 + TS4_diff) %>%
mutate(GSM484633 = GSM484633 + TS4_diff) %>%
mutate(GSM484634 = GSM484634 + TS4_diff) %>%
mutate(GSM484635 = GSM484635 + TS4_diff) %>%
mutate(GSM484636 = GSM484636 + TS4_diff) %>%
mutate(GSM484637 = GSM484637 + TS4_diff) %>%
mutate(GSM484638 = GSM484638 + TS4_diff) %>%
mutate(GSM484639 = GSM484639 + TS4_diff) %>%
mutate(GSM484640 = GSM484640 + TS4_diff) %>%
mutate(GSM484641 = GSM484641 + TS4_diff) %>%
mutate(GSM484642 = GSM484642 + TS4_diff) %>%
mutate(GSM484643 = GSM484643 + TS4_diff) %>%
mutate(GSM484644 = GSM484644 + TS4_diff) %>%
mutate(GSM484645 = GSM484645 + TS4_diff) %>%
mutate(GSM484646 = GSM484646 + TS4_diff) %>%
mutate(GSM484647 = GSM484647 + TS4_diff) %>%
mutate(GSM484648 = GSM484648 + TS4_diff) %>%
dplyr::select(-TS4_diff)
#TS5 normalized
TS5_normalized<- TS5_mean_gene %>% left_join(TS1_5_average_and_diff) %>%
dplyr::select(-TS1_mean, -TS3_mean, -TS4_mean, -TS1_diff, -TS3_diff, -TS4_diff, -TS5_mean, -TS1_3_4_5_mean) %>%
rowwise() %>%
mutate(GSM484500 = GSM484500 + TS5_diff) %>%
mutate(GSM484501 = GSM484501 + TS5_diff) %>%
mutate(GSM484502 = GSM484502 + TS5_diff) %>%
mutate(GSM484503 = GSM484503 + TS5_diff) %>%
mutate(GSM484504 = GSM484504 + TS5_diff) %>%
mutate(GSM484505 = GSM484505 + TS5_diff) %>%
mutate(GSM484506 = GSM484506 + TS5_diff) %>%
mutate(GSM484507 = GSM484507 + TS5_diff) %>%
mutate(GSM484508 = GSM484508 + TS5_diff) %>%
mutate(GSM484509 = GSM484509 + TS5_diff) %>%
mutate(GSM484510 = GSM484510 + TS5_diff) %>%
mutate(GSM484511 = GSM484511 + TS5_diff) %>%
mutate(GSM484512 = GSM484512 + TS5_diff) %>%
mutate(GSM484513 = GSM484513 + TS5_diff) %>%
mutate(GSM484514 = GSM484514 + TS5_diff) %>%
mutate(GSM484515 = GSM484515 + TS5_diff) %>%
mutate(GSM484516 = GSM484516 + TS5_diff) %>%
mutate(GSM484517 = GSM484517 + TS5_diff) %>%
mutate(GSM484518 = GSM484518 + TS5_diff) %>%
mutate(GSM484519 = GSM484519 + TS5_diff) %>%
mutate(GSM484520 = GSM484520 + TS5_diff) %>%
mutate(GSM484521 = GSM484521 + TS5_diff) %>%
mutate(GSM484522 = GSM484522 + TS5_diff) %>%
mutate(GSM484523 = GSM484523 + TS5_diff) %>%
mutate(GSM484524 = GSM484524 + TS5_diff) %>%
mutate(GSM484525 = GSM484525 + TS5_diff) %>%
mutate(GSM484526 = GSM484526 + TS5_diff) %>%
mutate(GSM484527 = GSM484527 + TS5_diff) %>%
mutate(GSM484528 = GSM484528 + TS5_diff) %>%
mutate(GSM484529 = GSM484529 + TS5_diff) %>%
mutate(GSM484530 = GSM484530 + TS5_diff) %>%
mutate(GSM484531 = GSM484531 + TS5_diff) %>%
mutate(GSM484532 = GSM484532 + TS5_diff) %>%
mutate(GSM484533 = GSM484533 + TS5_diff) %>%
mutate(GSM484534 = GSM484534 + TS5_diff) %>%
mutate(GSM484535 = GSM484535 + TS5_diff) %>%
mutate(GSM484536 = GSM484536 + TS5_diff) %>%
mutate(GSM484537 = GSM484537 + TS5_diff) %>%
mutate(GSM484538 = GSM484538 + TS5_diff) %>%
mutate(GSM484539 = GSM484539 + TS5_diff) %>%
mutate(GSM484540 = GSM484540 + TS5_diff) %>%
mutate(GSM484541 = GSM484541 + TS5_diff) %>%
mutate(GSM484542 = GSM484542 + TS5_diff) %>%
mutate(GSM484543 = GSM484543 + TS5_diff) %>%
mutate(GSM484544 = GSM484544 + TS5_diff) %>%
mutate(GSM484545 = GSM484545 + TS5_diff) %>%
mutate(GSM484546 = GSM484546 + TS5_diff) %>%
mutate(GSM484547 = GSM484547 + TS5_diff) %>%
mutate(GSM484548 = GSM484548 + TS5_diff) %>%
mutate(GSM484549 = GSM484549 + TS5_diff) %>%
mutate(GSM484550 = GSM484500 + TS5_diff) %>%
dplyr::select(-TS5_diff)
#TS6 normalized
TS6_mean_diff<- TS6_mean_gene %>% left_join(TS1_5_average_and_diff) %>%
dplyr::select(-TS1_mean, -TS3_mean, -TS4_mean, -TS1_diff, -TS3_diff, -TS4_diff, -TS5_mean, -TS5_diff) %>% rowwise() %>% mutate(TS6_diff = TS1_3_4_5_mean - TS6_mean)
TS6_normalized <- TS6_mean_diff %>%
rowwise() %>%
mutate(GSM996267 = GSM996267 + TS6_diff) %>%
mutate(GSM996268 = GSM996268 + TS6_diff) %>%
mutate(GSM996269 = GSM996269 + TS6_diff) %>%
mutate(GSM996270 = GSM996270 + TS6_diff) %>%
mutate(GSM996271 = GSM996271 + TS6_diff) %>%
mutate(GSM996272 = GSM996272 + TS6_diff) %>%
mutate(GSM996273 = GSM996273 + TS6_diff) %>%
mutate(GSM996274 = GSM996274 + TS6_diff) %>%
mutate(GSM996275 = GSM996275 + TS6_diff) %>%
mutate(GSM996276 = GSM996276 + TS6_diff) %>%
mutate(GSM996277 = GSM996277 + TS6_diff) %>%
mutate(GSM996278 = GSM996278 + TS6_diff) %>%
mutate(GSM996279 = GSM996279 + TS6_diff) %>%
mutate(GSM996280 = GSM996280 + TS6_diff) %>%
mutate(GSM996281 = GSM996281 + TS6_diff) %>%
mutate(GSM996282 = GSM996282 + TS6_diff) %>%
mutate(GSM996283 = GSM996283 + TS6_diff) %>%
mutate(GSM996284 = GSM996284 + TS6_diff) %>%
mutate(GSM996285 = GSM996285 + TS6_diff) %>%
mutate(GSM996286 = GSM996286 + TS6_diff) %>%
mutate(GSM996287 = GSM996287 + TS6_diff) %>%
mutate(GSM996288 = GSM996288 + TS6_diff) %>%
mutate(GSM996289 = GSM996289 + TS6_diff) %>%
mutate(GSM996290 = GSM996290 + TS6_diff) %>%
mutate(GSM996291 = GSM996291 + TS6_diff) %>%
mutate(GSM996292 = GSM996292 + TS6_diff) %>%
mutate(GSM996293 = GSM996293 + TS6_diff) %>%
mutate(GSM996294 = GSM996294 + TS6_diff) %>%
mutate(GSM996295 = GSM996295 + TS6_diff) %>%
mutate(GSM996296 = GSM996296 + TS6_diff) %>%
mutate(GSM996297 = GSM996297 + TS6_diff) %>%
mutate(GSM996298 = GSM996298 + TS6_diff) %>%
mutate(GSM996299 = GSM996299 + TS6_diff) %>%
mutate(GSM996300 = GSM996300 + TS6_diff) %>%
mutate(GSM996311 = GSM996311 + TS6_diff) %>%
mutate(GSM996312 = GSM996312 + TS6_diff) %>%
mutate(GSM996313 = GSM996313 + TS6_diff) %>%
mutate(GSM996314 = GSM996314 + TS6_diff) %>%
mutate(GSM996315 = GSM996315 + TS6_diff) %>%
mutate(GSM996316 = GSM996316 + TS6_diff) %>%
mutate(GSM996317 = GSM996317 + TS6_diff) %>%
mutate(GSM996318 = GSM996318 + TS6_diff) %>%
mutate(GSM996319 = GSM996319 + TS6_diff) %>%
mutate(GSM996320 = GSM996320 + TS6_diff) %>%
mutate(GSM996321 = GSM996321 + TS6_diff) %>%
mutate(GSM996322 = GSM996322 + TS6_diff) %>%
mutate(GSM996323 = GSM996323 + TS6_diff) %>%
mutate(GSM996324 = GSM996324 + TS6_diff) %>%
mutate(GSM996325 = GSM996325 + TS6_diff) %>%
mutate(GSM996326 = GSM996326 + TS6_diff) %>%
mutate(GSM996327 = GSM996327 + TS6_diff) %>%
mutate(GSM996328 = GSM996328 + TS6_diff) %>%
mutate(GSM996329 = GSM996329 + TS6_diff) %>%
mutate(GSM996330 = GSM996330 + TS6_diff) %>%
mutate(GSM996331 = GSM996331 + TS6_diff) %>%
mutate(GSM996332 = GSM996332 + TS6_diff) %>%
mutate(GSM996333 = GSM996333 + TS6_diff) %>%
mutate(GSM996334 = GSM996334 + TS6_diff) %>%
mutate(GSM996335 = GSM996335 + TS6_diff) %>%
mutate(GSM996336 = GSM996336 + TS6_diff) %>%
mutate(GSM996337 = GSM996337 + TS6_diff) %>%
mutate(GSM996338 = GSM996338 + TS6_diff) %>%
mutate(GSM996339 = GSM996339 + TS6_diff) %>%
mutate(GSM996340 = GSM996340 + TS6_diff) %>%
mutate(GSM996341 = GSM996341 + TS6_diff) %>%
mutate(GSM996342 = GSM996342 + TS6_diff) %>%
mutate(GSM996343 = GSM996343 + TS6_diff) %>%
mutate(GSM996344 = GSM996344 + TS6_diff) %>%
mutate(GSM996345 = GSM996345 + TS6_diff) %>%
mutate(GSM996346 = GSM996346 + TS6_diff) %>%
mutate(GSM996347 = GSM996347 + TS6_diff) %>%
mutate(GSM996348 = GSM996348 + TS6_diff) %>%
mutate(GSM996349 = GSM996349 + TS6_diff) %>%
mutate(GSM996350 = GSM996350 + TS6_diff) %>%
mutate(GSM996351 = GSM996351 + TS6_diff) %>%
mutate(GSM996352 = GSM996352 + TS6_diff) %>%
mutate(GSM996353 = GSM996353 + TS6_diff) %>%
mutate(GSM996354 = GSM996354 + TS6_diff) %>%
mutate(GSM996355 = GSM996355 + TS6_diff) %>%
mutate(GSM996356 = GSM996356 + TS6_diff) %>%
mutate(GSM996357 = GSM996357 + TS6_diff) %>%
mutate(GSM996358 = GSM996358 + TS6_diff) %>%
mutate(GSM996359 = GSM996359 + TS6_diff) %>%
mutate(GSM996360 = GSM996360 + TS6_diff) %>%
mutate(GSM996361 = GSM996361 + TS6_diff) %>%
mutate(GSM996362 = GSM996362 + TS6_diff) %>%
mutate(GSM996363 = GSM996363 + TS6_diff) %>%
mutate(GSM996364 = GSM996364 + TS6_diff) %>%
mutate(GSM996365 = GSM996365 + TS6_diff) %>%
mutate(GSM996366 = GSM996366 + TS6_diff) %>%
mutate(GSM996367 = GSM996367 + TS6_diff) %>%
mutate(GSM996368 = GSM996368 + TS6_diff) %>%
mutate(GSM996369 = GSM996369 + TS6_diff) %>%
mutate(GSM996370 = GSM996370 + TS6_diff) %>%
mutate(GSM996371 = GSM996371 + TS6_diff) %>%
mutate(GSM996372 = GSM996372 + TS6_diff) %>%
mutate(GSM996373 = GSM996373 + TS6_diff) %>%
mutate(GSM996374 = GSM996374 + TS6_diff) %>%
mutate(GSM996375 = GSM996375 + TS6_diff) %>%
mutate(GSM996376 = GSM996376 + TS6_diff) %>%
mutate(GSM996377 = GSM996377 + TS6_diff) %>%
mutate(GSM996378 = GSM996378 + TS6_diff) %>%
mutate(GSM996379 = GSM996379 + TS6_diff) %>%
mutate(GSM996380 = GSM996380 + TS6_diff) %>%
mutate(GSM996381 = GSM996381 + TS6_diff) %>%
mutate(GSM996382 = GSM996382 + TS6_diff) %>%
mutate(GSM996383 = GSM996383 + TS6_diff) %>%
mutate(GSM996384 = GSM996384 + TS6_diff) %>%
mutate(GSM996385 = GSM996385 + TS6_diff) %>%
mutate(GSM996386 = GSM996386 + TS6_diff) %>%
mutate(GSM996387 = GSM996387 + TS6_diff) %>%
mutate(GSM996388 = GSM996388 + TS6_diff) %>%
mutate(GSM996389 = GSM996389 + TS6_diff) %>%
mutate(GSM996390 = GSM996390 + TS6_diff) %>%
mutate(GSM996391 = GSM996391 + TS6_diff) %>%
mutate(GSM996392 = GSM996392 + TS6_diff) %>%
mutate(GSM996393 = GSM996393 + TS6_diff) %>%
mutate(GSM996394 = GSM996394 + TS6_diff) %>%
mutate(GSM996395 = GSM996395 + TS6_diff) %>%
mutate(GSM996396 = GSM996396 + TS6_diff) %>%
mutate(GSM996397 = GSM996397 + TS6_diff) %>%
mutate(GSM996398 = GSM996398 + TS6_diff) %>%
mutate(GSM996399 = GSM996399 + TS6_diff) %>%
mutate(GSM996400 = GSM996400 + TS6_diff) %>%
mutate(GSM996401 = GSM996401 + TS6_diff) %>%
mutate(GSM996402 = GSM996402 + TS6_diff) %>%
mutate(GSM996403 = GSM996403 + TS6_diff) %>%
mutate(GSM996404 = GSM996404 + TS6_diff) %>%
mutate(GSM996405 = GSM996405 + TS6_diff) %>%
mutate(GSM996406 = GSM996406 + TS6_diff) %>%
mutate(GSM996407 = GSM996407 + TS6_diff) %>%
mutate(GSM996408 = GSM996408 + TS6_diff) %>%
mutate(GSM996409 = GSM996409 + TS6_diff) %>%
mutate(GSM996410 = GSM996410 + TS6_diff) %>%
mutate(GSM996411 = GSM996411 + TS6_diff) %>%
mutate(GSM996412 = GSM996412 + TS6_diff) %>%
mutate(GSM996413 = GSM996413 + TS6_diff) %>%
mutate(GSM996414 = GSM996414 + TS6_diff) %>%
mutate(GSM996415 = GSM996415 + TS6_diff) %>%
mutate(GSM996416 = GSM996416 + TS6_diff) %>%
mutate(GSM996417 = GSM996417 + TS6_diff) %>%
mutate(GSM996418 = GSM996418 + TS6_diff) %>%
mutate(GSM996419 = GSM996419 + TS6_diff) %>%
mutate(GSM996420 = GSM996420 + TS6_diff) %>%
mutate(GSM996421 = GSM996421 + TS6_diff) %>%
mutate(GSM996422 = GSM996422 + TS6_diff) %>%
mutate(GSM996423 = GSM996423 + TS6_diff) %>%
mutate(GSM996424 = GSM996424 + TS6_diff) %>%
mutate(GSM996425 = GSM996425 + TS6_diff) %>%
mutate(GSM996426 = GSM996426 + TS6_diff) %>%
mutate(GSM996427 = GSM996427 + TS6_diff) %>%
mutate(GSM996428 = GSM996428 + TS6_diff) %>%
mutate(GSM996429 = GSM996429 + TS6_diff) %>%
mutate(GSM996430 = GSM996430 + TS6_diff) %>%
mutate(GSM996431 = GSM996431 + TS6_diff) %>%
mutate(GSM996432 = GSM996432 + TS6_diff) %>%
mutate(GSM996433 = GSM996433 + TS6_diff) %>%
mutate(GSM996434 = GSM996434 + TS6_diff) %>%
mutate(GSM996435 = GSM996435 + TS6_diff) %>%
mutate(GSM996436 = GSM996436 + TS6_diff) %>%
mutate(GSM996437 = GSM996437 + TS6_diff) %>%
mutate(GSM996438 = GSM996438 + TS6_diff) %>%
mutate(GSM996439 = GSM996439 + TS6_diff) %>%
mutate(GSM996440 = GSM996440 + TS6_diff) %>%
mutate(GSM996441 = GSM996441 + TS6_diff) %>%
mutate(GSM996442 = GSM996442 + TS6_diff) %>%
mutate(GSM996443 = GSM996443 + TS6_diff) %>%
mutate(GSM996444 = GSM996444 + TS6_diff) %>%
mutate(GSM996445 = GSM996445 + TS6_diff) %>%
mutate(GSM996446 = GSM996446 + TS6_diff) %>%
mutate(GSM996447 = GSM996447 + TS6_diff) %>%
mutate(GSM996448 = GSM996448 + TS6_diff) %>%
mutate(GSM996449 = GSM996449 + TS6_diff) %>%
mutate(GSM996450 = GSM996450 + TS6_diff) %>%
mutate(GSM996451 = GSM996451 + TS6_diff) %>%
mutate(GSM996452 = GSM996452 + TS6_diff) %>%
mutate(GSM996453 = GSM996453 + TS6_diff) %>%
mutate(GSM996454 = GSM996454 + TS6_diff) %>%
mutate(GSM996455 = GSM996455 + TS6_diff) %>%
mutate(GSM996456 = GSM996456 + TS6_diff) %>%
mutate(GSM996457 = GSM996457 + TS6_diff) %>%
mutate(GSM996458 = GSM996458 + TS6_diff) %>%
mutate(GSM996459 = GSM996459 + TS6_diff) %>%
mutate(GSM996460 = GSM996460 + TS6_diff) %>%
mutate(GSM996461 = GSM996461 + TS6_diff) %>%
mutate(GSM996462 = GSM996462 + TS6_diff) %>%
mutate(GSM996463 = GSM996463 + TS6_diff) %>%
mutate(GSM996464 = GSM996464 + TS6_diff) %>%
mutate(GSM996465 = GSM996465 + TS6_diff) %>%
mutate(GSM996466 = GSM996466 + TS6_diff) %>%
mutate(GSM996467 = GSM996467 + TS6_diff) %>%
mutate(GSM996468 = GSM996468 + TS6_diff) %>%
mutate(GSM996469 = GSM996469 + TS6_diff) %>%
mutate(GSM996470 = GSM996470 + TS6_diff) %>%
mutate(GSM996471 = GSM996471 + TS6_diff) %>%
mutate(GSM996472 = GSM996472 + TS6_diff) %>%
mutate(GSM996473 = GSM996473 + TS6_diff) %>%
mutate(GSM996474 = GSM996474 + TS6_diff) %>%
mutate(GSM996475 = GSM996475 + TS6_diff) %>%
mutate(GSM996476 = GSM996476 + TS6_diff) %>%
mutate(GSM996477 = GSM996477 + TS6_diff) %>%
mutate(GSM996478 = GSM996478 + TS6_diff) %>%
mutate(GSM996479 = GSM996479 + TS6_diff) %>%
mutate(GSM996480 = GSM996480 + TS6_diff) %>%
dplyr::select(-TS6_diff, -TS1_3_4_5_mean, -TS6_mean)
#normalized data with only sample IDs as columns
#TS1 normalized
TS1_normalized.1 <- TS1_normalized %>%`row.names<-`(., NULL) %>% column_to_rownames(var = "Gene")
#TS3 normalized
TS3_normalized.1 <- TS3_normalized %>%`row.names<-`(., NULL) %>% column_to_rownames(var = "Gene")
#TS4 normalized
TS4_normalized.1 <- TS4_normalized %>%`row.names<-`(., NULL) %>% column_to_rownames(var = "Gene")
#TS5 normalized
TS5_normalized.1 <- TS5_normalized %>%`row.names<-`(., NULL) %>% column_to_rownames(var = "Gene")
#TS6 normalized
TS6_normalized.1 <- TS6_normalized %>%`row.names<-`(., NULL) %>% column_to_rownames(var = "Gene")
Pheno_TS1_new <- Pheno_TS1 %>% filter(`illness:ch1` == "Control" | `time post initiation of treatment:ch1` == "2_months")
#TS1 and TS2 for analysis data
Pheno_TS2_part1 <- Pheno_TS1 %>% filter(`time post initiation of treatment:ch1` == "0_months" & `illness:ch1` == "PTB")
Pheno_TS2_part2 <- Pheno_TS1 %>% filter(`time post initiation of treatment:ch1` == "12_months")
Pheno_TS2 <- rbind(Pheno_TS2_part1, Pheno_TS2_part2)
Pheno_TS1_new_sampleid.df <- Pheno_TS1_new %>% tibble::rownames_to_column("Sample_ID")
Pheno_TS1_new_sampleid_vector <- Pheno_TS1_new_sampleid.df[, "Sample_ID"]
Pheno_TS1_new_sampleid <- dput(as.character(Pheno_TS1_new_sampleid_vector))
Expression_TS1_normalized_new <- TS1_normalized.1[, colnames(TS1_normalized.1) %in% Pheno_TS1_new_sampleid]
Pheno_TS2_sampleid.df <- Pheno_TS2 %>% tibble::rownames_to_column("Sample_ID")
Pheno_TS2_sampleid_vector <- Pheno_TS2_sampleid.df[, "Sample_ID"]
Pheno_TS2_sampleid <- dput(as.character(Pheno_TS2_sampleid_vector))
Expression_TS2_normalized <- TS1_normalized.1[, colnames(TS1_normalized.1) %in% Pheno_TS2_sampleid]
Principal Component Analysis(PCA) was conducted on the training set and TS1 to cluster the data for the presence of TB. The sensitivity, specificity and accuracy were found and a receiving operating characteristic(ROC) curve was produced with the area under the curve(AUC) calculated for the training set and all the test sets.
fc_paper_20 <- c(1.402, 1.172, 1.248, -1.260, 1.492, 1.422, 1.374, 1.256, 1.441, 1.336, 1.454, 1.102, 1.255, -1.586, 1.421, -1.234, 1.131, 1.253, -3.149, 1.275)
Paper_top20genes <- read.csv("Paper_top20genes.csv") %>% dplyr::rename(Rank=1, `Gene Name` = Gene.Name, `Fold Change (HIV+TB)/HIV` = Fold.Change..HIV.TB..HIV) %>% transmute(Rank, Accession, Symbol, `Gene Name`, `Fold Change (HIV+TB)/HIV` = fc_paper_20)
kable(Paper_top20genes, caption = "Table 1.The Published Paper's Top 20 Genes from the 251 Gene Signature Identified Using SVM-RFE")
| Rank | Accession | Symbol | Gene Name | Fold Change (HIV+TB)/HIV |
|---|---|---|---|---|
| 1 | NM_021004 | DHRS4 | dehydrogenase/reductase (SDR family) member 4 | 1.402 |
| 2 | NM_001990 | EYA3 | eyes absent homolog 3 (Drosophila) | 1.172 |
| 3 | NM_030629 | CMIP | c-Maf-inducing protein | 1.248 |
| 4 | NM_014233 | UBTF | upstream binding transcription factor, RNA polymerase I | -1.260 |
| 5 | NM_032227 | TMEM164 | transmembrane protein 164 | 1.492 |
| 6 | NM_078487 | CDKN2B | cyclin-dependent kinase inhibitor 2B (p15, inhibits CDK4) | 1.422 |
| 7 | NM_005607 | PTK2 | PTK2 protein tyrosine kinase 2 | 1.374 |
| 8 | XM_939593 | LOC648605 | PREDICTED: similar to Trimethyllysine dioxygenase, mitochondrial precursor | 1.256 |
| 9 | CR621233 | NaN | full-length cDNA clone CS0DI057YA22 of Placenta Cot25-normalized of (human) | 1.441 |
| 10 | NM_003501 | ACOX3 | acyl-Coenzyme A oxidase 3, pristanoyl | 1.336 |
| 11 | XM_001126647 | MLKL | PREDICTED: mixed lineage kinase domain-like | 1.454 |
| 12 | NM_007100 | ATP5I | ATP synthase, H+ transporting, mitochondrial F0 complex, subunit E | 1.102 |
| 13 | NM_005819 | STX6 | syntaxin 6 | 1.255 |
| 14 | NM_152858 | WTAP | Wilms tumor 1 associated protein | -1.586 |
| 15 | NM_001531 | MR1 | major histocompatibility complex, class I-related | 1.421 |
| 16 | XM_045290 | LOC151579 | PREDICTED: similar to basic leucine zipper and W2 domains 1 | -1.234 |
| 17 | NM_025164 | SIK3 | SIK family kinase 3 | 1.131 |
| 18 | XM_036729 | USP41 | PREDICTED: ubiquitin specific peptidase 41 | 1.253 |
| 19 | NM_145172 | WDR63 | WD repeat domain 63 | -3.149 |
| 20 | NM_023015 | INTS3 | integrator complex subunit 3 | 1.275 |
# Generating dataset for SVM
Expression_data <- t(Expression_cut) %>% as.data.frame()
Class_data <- factor(Pheno_cut$description) %>% as.data.frame()
colnames(Class_data) = "Class"
svm_data <- cbind(Expression_data, Class_data)
#finding best tuning parameters
#set.seed(1)
#tune_svm <- tune.svm(Pheno_cut_description~., data=svm_data, cost=seq(1,2,by=0.1), kernel="linear")
#tune_svm$best.parameters
#summary(tune_svm)
#SVM-RFE
set.seed(3002)
n = 1000
df = svm_data
while (n > 280) {
fit = svm(Class ~ ., data = df, kernel = "linear", cost = 1)
weights = t(fit$coefs) %*% fit$SV %>% as.data.frame()
absolute_weights_sorted <-
t(sort(abs(weights), decreasing = T)) %>%
as.data.frame() %>%
dplyr::rename(weights=V1) %>%
tibble::rownames_to_column("Gene")
n = round(0.9*nrow(absolute_weights_sorted),0)
print(n)
if (n < 290) {
Genes_251 <- absolute_weights_sorted %>% arrange(desc(weights)) %>% head(251)
keep_genes <- Genes_251$Gene
}
else {
eliminate10 <- absolute_weights_sorted[1:n,]
keep_genes <- eliminate10$Gene
}
df_genes <- df[,colnames(df) %in% keep_genes]
df <- cbind(df_genes, Class_data)
}
Genes251 <- read.csv("251_gene_names.csv") %>% unlist() %>% as.vector()
foldchangevalues <- read.csv("Origin_genes_withfc.csv")
keep_genes <- dput(Genes251)
Expression_251 <- Expression_cut[rownames(Expression_cut) %in% Genes251,]
Gene_251_fdata_fortable <- Genes[rownames(Genes) %in% keep_genes,] %>% dplyr::select(Accession, Symbol, Definition)
Gene_251_fc <- foldchangevalues%>%filter(Gene %in% keep_genes)
Top20genes_table <- cbind(Gene_251_fdata_fortable, Gene_251_fc) %>% transmute(Rank =c(1:251), Accession, Symbol, `Gene Name` = Definition, `Fold Change` = Fold_Change) %>% mutate(`Fold Change` = round(`Fold Change`,3)) %>% head(20)
rownames(Top20genes_table) <- NULL
kable(Top20genes_table, caption = "Table 2.Top 20 Genes from the 251 Gene Signature Identified Using SVM-RFE from Replication")
| Rank | Accession | Symbol | Gene Name | Fold Change |
|---|---|---|---|---|
| 1 | NM_005610.1 | RBBP4 | Homo sapiens retinoblastoma binding protein 4 (RBBP4), mRNA. | 2.087 |
| 2 | XM_944654.2 | LOC648716 | PREDICTED: Homo sapiens hypothetical LOC648716, transcript variant 2 (LOC648716), mRNA. | 1.583 |
| 3 | NM_145172.2 | WDR63 | Homo sapiens WD repeat domain 63 (WDR63), mRNA. | 1.432 |
| 4 | NM_032873.3 | STS-1 | Homo sapiens Cbl-interacting protein Sts-1 (STS-1), mRNA. | 1.832 |
| 5 | NM_152758.4 | YTHDF3 | Homo sapiens YTH domain family, member 3 (YTHDF3), mRNA. | 1.697 |
| 6 | NM_000155.2 | GALT | Homo sapiens galactose-1-phosphate uridylyltransferase (GALT), mRNA. | 1.854 |
| 7 | NM_012228.2 | MSRB2 | Homo sapiens methionine sulfoxide reductase B2 (MSRB2), mRNA. | 3.928 |
| 8 | NM_001273.2 | CHD4 | Homo sapiens chromodomain helicase DNA binding protein 4 (CHD4), mRNA. | 1.267 |
| 9 | NM_007233.1 | TP53AP1 | Homo sapiens TP53 activated protein 1 (TP53AP1), mRNA. | 1.712 |
| 10 | NM_006977.2 | ZBTB25 | Homo sapiens zinc finger and BTB domain containing 25 (ZBTB25), mRNA. | 2.008 |
| 11 | NM_023914.2 | P2RY13 | Homo sapiens purinergic receptor P2Y, G-protein coupled, 13 (P2RY13), transcript variant 1, mRNA. | 7.779 |
| 12 | NM_001002255.1 | SUMO4 | Homo sapiens SMT3 suppressor of mif two 3 homolog 4 (S. cerevisiae) (SUMO4), mRNA. | 1.539 |
| 13 | XM_045290.6 | LOC151579 | PREDICTED: Homo sapiens similar to basic leucine zipper and W2 domains 1 (LOC151579), mRNA. | 1.154 |
| 14 | NM_012179.3 | FBXO7 | Homo sapiens F-box protein 7 (FBXO7), transcript variant 1, mRNA. | 1.739 |
| 15 | NM_032376.2 | TMEM101 | Homo sapiens transmembrane protein 101 (TMEM101), mRNA. | 1.490 |
| 16 | XM_001126647.1 | MLKL | PREDICTED: Homo sapiens mixed lineage kinase domain-like (MLKL), mRNA. | 1.698 |
| 17 | NM_024407.3 | NDUFS7 | Homo sapiens NADH dehydrogenase (ubiquinone) Fe-S protein 7, 20kDa (NADH-coenzyme Q reductase) (NDUFS7), mRNA. | 1.709 |
| 18 | NM_005119.2 | THRAP3 | Homo sapiens thyroid hormone receptor associated protein 3 (THRAP3), mRNA. | 1.382 |
| 19 | NM_025125.2 | C10orf57 | Homo sapiens chromosome 10 open reading frame 57 (C10orf57), mRNA. | 4.426 |
| 20 | NM_017437.1 | CPSF2 | Homo sapiens cleavage and polyadenylation specific factor 2, 100kDa (CPSF2), mRNA. | 1.647 |
Overlapping_genes <- Paper_top20genes %>% inner_join(Top20genes_table, by="Symbol") %>% rename_at(vars(matches(".x")), ~ sub(".x", "_Paper", .x)) %>% rename_at(vars(matches(".y")), ~ sub(".y", "_Replication", .x))
Overlapping_genes_table <- Overlapping_genes %>% dplyr::rename(Symbol =3, `Fold Change_Paper` = `Fold Change (HIV+TB)/HIV`,`Fold Change_Replication` = `Fold Change`) %>% relocate(Symbol, .before=Rank_Paper)
kable(Overlapping_genes_table, caption = "Table 3.Overlapping Genes of the Top 20 Genes from the 251 Gene Signature Identified Using SVM-RFE from the Published Paper and Replication")
| Symbol | Rank_Paper | Accession_Paper | Gene Name_Paper | Fold Change_Paper | Rank_Replication | Accession_Replication | Gene Name_Replication | Fold Change_Replication |
|---|---|---|---|---|---|---|---|---|
| MLKL | 11 | XM_001126647 | PREDICTED: mixed lineage kinase domain-like | 1.454 | 16 | XM_001126647.1 | PREDICTED: Homo sapiens mixed lineage kinase domain-like (MLKL), mRNA. | 1.698 |
| LOC151579 | 16 | XM_045290 | PREDICTED: similar to basic leucine zipper and W2 domains 1 | -1.234 | 13 | XM_045290.6 | PREDICTED: Homo sapiens similar to basic leucine zipper and W2 domains 1 (LOC151579), mRNA. | 1.154 |
| WDR63 | 19 | NM_145172 | WD repeat domain 63 | -3.149 | 3 | NM_145172.2 | Homo sapiens WD repeat domain 63 (WDR63), mRNA. | 1.432 |
Genes_copy <- Genes
paper_251genes <- read.csv("Paper_251_genes.csv") %>% dplyr::rename(Accession=1) %>% filter(row_number() %in% c(1:251))
Genes_copy$Search_Key = sub("\\..*", "", Genes_copy$Search_Key)
#Genes paper 251 data
Gene_251_paper_fdata <- Genes_copy %>% inner_join(paper_251genes, by=c("Search_Key" = "Accession", "Symbol")) %>% filter(row_number() %in% c(1:251))
keep_genes_paper <- dput(Gene_251_paper_fdata %>% dplyr::select(ID) %>% unlist() %>% as.vector())
#Expression paper 251 data
Expression_251_paper <- Expression_cut[rownames(Expression_cut) %in% keep_genes_paper,]
#svm data origin
Expression_251_origin_t <- as.data.frame(t(Expression_251_paper))
Pheno_251_origin_disease <- factor(Pheno_cut$description, levels = c("HIV/TB co-infection", "HIV"), labels = c("HIV/TB co-infection", "HIV"))
svm_data.10_new <- cbind(Expression_251_origin_t, Pheno_251_origin_disease)
#svm data TS1
Expression_TS1_t <- as.data.frame(t(Expression_TS1_normalized_new))
Pheno_TS1_treatmenttime <- factor(Pheno_TS1_new$`time post initiation of treatment:ch1`)
svm_data_TS1 <- cbind(Expression_TS1_t, Pheno_TS1_treatmenttime)
svm_data_TS1_251_draft <- svm_data_TS1[,colnames(svm_data_TS1) %in% keep_genes_paper]
svm_data_TS1_251 <- cbind(svm_data_TS1_251_draft, Pheno_TS1_treatmenttime) %>% dplyr::select(-ILMN_1738075)
#svm data TS2
Expression_TS2_t <- as.data.frame(t(Expression_TS2_normalized))
Pheno_TS2_treatmenttime <- factor(Pheno_TS2$`time post initiation of treatment:ch1`)
svm_data_TS2 <- cbind(Expression_TS2_t, Pheno_TS2_treatmenttime)
svm_data_TS2_251 <- svm_data_TS2[,colnames(svm_data_TS2) %in% keep_genes_paper] %>% cbind(Pheno_TS2_treatmenttime) %>% dplyr::select(-ILMN_1738075)
#svm data TS3
Pheno_TS3_illnessupdate <- Pheno_TS3 %>% dplyr::rename(illness = `illness:ch1`) %>% mutate(illness = case_when(illness == "Latent" ~ "Controls/Latent TB", illness == "Control(BCG-)" ~ "Controls/Latent TB", illness == "PTB" ~ "TB", illness== "Control (BCG+)" ~ "Controls/Latent TB"))
Expression_TS3_t <- as.data.frame(t(TS3_normalized.1))
Pheno_TS3_illness <- factor(Pheno_TS3_illnessupdate$illness)
svm_data_TS3 <- cbind(Expression_TS3_t, Pheno_TS3_illness)
svm_data_TS3_251 <- svm_data_TS3[,colnames(svm_data_TS3) %in% keep_genes_paper] %>% cbind(Pheno_TS3_illness)
#svm data TS4
Pheno_TS4_illnessupdate <- Pheno_TS4 %>% dplyr::rename(illness = characteristics_ch1.3) %>% mutate(illness = case_when(illness == "illness: Control (BCG+)" ~ "Controls/Latent TB", illness == "illness: Latent" ~ "Controls/Latent TB", illness == "illness: PTB" ~ "TB"))
Expression_TS4_t <- as.data.frame(t(TS4_normalized.1))
Pheno_TS4_illness <- factor(Pheno_TS4_illnessupdate$illness)
svm_data_TS4 <- cbind(Expression_TS4_t, Pheno_TS4_illness)
svm_data_TS4_251 <- svm_data_TS4[,colnames(svm_data_TS4) %in% keep_genes_paper] %>% cbind(Pheno_TS4_illness) %>% dplyr::select(-ILMN_1738075)
#svm data TS5
Pheno_TS5_illnessupdate <- Pheno_TS5 %>% dplyr::rename(illness = `illness:ch1`) %>% mutate(illness = case_when(illness == "LATENT TB" ~ "Latent TB", illness == "PTB" ~ "TB"))
Expression_TS5_t <- as.data.frame(t(TS5_normalized.1))
Pheno_TS5_illness <- factor(Pheno_TS5_illnessupdate$illness)
svm_data_TS5 <- cbind(Expression_TS5_t, Pheno_TS5_illness)
svm_data_TS5_251 <- svm_data_TS5[,colnames(svm_data_TS5) %in% keep_genes_paper] %>% cbind(Pheno_TS5_illness) %>% dplyr::select(-ILMN_1738075)
#svm data TS6
Pheno_TS6_illnessupdate <- Pheno_TS6 %>% dplyr::rename(illness = characteristics_ch1) %>% mutate(illness = case_when(illness == "disease: LTB" ~ "Latent TB", illness == "disease: PTB" ~ "TB"))
Expression_TS6_t <- as.data.frame(t(TS6_normalized.1))
Pheno_TS6_illness <- factor(Pheno_TS6_illnessupdate$illness)
svm_data_TS6 <- cbind(Expression_TS6_t, Pheno_TS6_illness)
svm_data_TS6_251 <- svm_data_TS6[,colnames(svm_data_TS6) %in% keep_genes_paper] %>% cbind(Pheno_TS6_illness)
#common genes for all datasets
origin_old <- t(svm_data.10_new) %>% as.data.frame() %>% tibble::rownames_to_column("Gene")
TS1_old <- t(svm_data_TS1_251) %>% as.data.frame() %>% tibble::rownames_to_column("Gene")
TS2_old <- t(svm_data_TS2_251) %>% as.data.frame() %>% tibble::rownames_to_column("Gene")
TS3_old <- t(svm_data_TS3_251) %>% as.data.frame() %>% tibble::rownames_to_column("Gene")
TS4_old <- t(svm_data_TS4_251) %>% as.data.frame() %>% tibble::rownames_to_column("Gene")
TS5_old <- t(svm_data_TS5_251) %>% as.data.frame() %>% tibble::rownames_to_column("Gene")
TS6_old <- t(svm_data_TS6_251) %>% as.data.frame() %>% tibble::rownames_to_column("Gene")
Common_genes_datasets <- origin_old %>% inner_join(TS1_old) %>%
inner_join(TS2_old) %>%
inner_join(TS3_old) %>%
inner_join(TS4_old) %>%
inner_join(TS5_old) %>%
inner_join(TS6_old) %>%
dplyr::select(Gene)
Common_genes_dataset_vector <- dput(Common_genes_datasets$Gene)
#new svm data
svm_data_origin_common <- svm_data.10_new[, colnames(svm_data.10_new) %in% Common_genes_dataset_vector]
TS1_data_origin_common <- svm_data_TS1_251[, colnames(svm_data_TS1_251) %in% Common_genes_dataset_vector]
TS2_data_origin_common <- svm_data_TS2_251[, colnames(svm_data_TS2_251) %in% Common_genes_dataset_vector]
TS3_data_origin_common <- svm_data_TS3_251[, colnames(svm_data_TS3_251) %in% Common_genes_dataset_vector]
TS4_data_origin_common <- svm_data_TS4_251[, colnames(svm_data_TS4_251) %in% Common_genes_dataset_vector]
TS5_data_origin_common <- svm_data_TS5_251[, colnames(svm_data_TS5_251) %in% Common_genes_dataset_vector]
TS6_data_origin_common <- svm_data_TS6_251[, colnames(svm_data_TS6_251) %in% Common_genes_dataset_vector]
#svm score origin
fit_origin_new <- svm(Pheno_251_origin_disease ~ ., data=svm_data_origin_common, kernel="linear", cost=1)
svm_score_origin_new <- predict(fit_origin_new, svm_data_origin_common, decision.values = TRUE)
SVM_Scores_Origin_New<- read.csv("SVM_Scores_Origin_New.csv")%>% dplyr::rename(Sample_ID = 1) %>% mutate(Predicted_Illness = case_when(SVM_Score <0 ~ "HIV/TB co-infection", SVM_Score>=0 ~ "HIV")) %>% left_join(Pheno,by=c("Sample_ID" = "geo_accession")) %>% dplyr::select(Sample_ID, SVM_Score, Predicted_Illness, description) %>% dplyr::rename(illness = description)
colour_graph <- c( rep("#fafcfb", 22), rep("#636363",21))
ggplot_svm_origin <- ggplot(SVM_Scores_Origin_New, aes(x= reorder(Sample_ID, -SVM_Score), y=SVM_Score)) + geom_col(stat = "identity", fill = colour_graph, colour="black") + labs(title = "Training Set", y = "SVM Score") + theme_minimal()+ theme(plot.title=element_text(hjust=0.5, size=15), axis.title.x = element_text(hjust = 0.5), axis.title.y = element_text(size=15)) + theme(axis.title.x=element_blank(),axis.text.x=element_blank(),axis.ticks.x=element_blank())
figure_paper_svm_training <- include_graphics("Paper_SVM_Training.png")
figure_svm_training <- include_graphics("ggplot_svm_training.png")
include_graphics(c(figure_paper_svm_training, figure_svm_training))
Figure 2. Support Vector Machine Scores for the Training Set. Left: The Published Paper. Right: Replication Result.
pal <- colorpanel(32, "blue", "white", "red")
#heatmap(as.matrix(log(Expression_251_paper)), col = pal, labRow=FALSE, labCol = FALSE, xlab="Sample", ylab="Genes")
figure_paper_heatmap<- include_graphics("Paper_Heatmap.png")
figure_heatmap <- include_graphics("plot_heatmap.png")
include_graphics(c(figure_paper_heatmap, figure_heatmap))
Figure 3. Heatmap based on the Gene Expression of the Training Set. Left: The Published Paper. Right: Replication Result.
#PCA genes paper
PCA_genes_paper <- Gene_251_paper_fdata %>% dplyr::select(ID, Symbol)
#PCA counts paper
PCA_counts_TS1_paper <- as.data.frame(Expression_TS1)[rownames(as.data.frame(Expression_TS1)) %in% keep_genes_paper,]
PCA_counts_TS1_genes.df_paper <- PCA_counts_TS1_paper %>% tibble::rownames_to_column("Gene")
PCA_counts_TS1_vector_paper <- PCA_counts_TS1_genes.df_paper [, "Gene"]
PCA_counts_TS1_genes_paper <- dput(as.character(PCA_counts_TS1_vector_paper))
PCA_counts_Origin_paper <- Expression_251_paper[rownames(Expression_251_paper) %in% PCA_counts_TS1_genes_paper,]
PCA_counts_paper <- cbind(PCA_counts_Origin_paper, PCA_counts_TS1_paper)
#PCA samples paper
PCA_pheno_origin <- Pheno_cut %>% dplyr::select(`disease state:ch1`) %>% dplyr::rename(disease_state = `disease state:ch1`) %>% mutate(disease_state = case_when(disease_state == "HIV only" ~ "HIV", disease_state == "HIV/TB" ~ "HIV/MTB")) %>% mutate(Data = "Origin") %>% dplyr::rename(Disease_State = disease_state)
PCA_pheno_TS1 <- Pheno_TS1 %>% dplyr::select(`time post initiation of treatment:ch1`, `illness:ch1`) %>% dplyr::rename(disease_state = `time post initiation of treatment:ch1`) %>% mutate(disease_state = case_when(disease_state == "0_months" & `illness:ch1` == "Control" ~ "Controls", disease_state == "0_months" & `illness:ch1` == "PTB" ~ "TB", disease_state == "2_months" ~ "TB (2Mo)", disease_state =="12_months" ~ "TB (12Mo)")) %>% dplyr::select(disease_state) %>% mutate(Data = "TS1") %>% dplyr::rename(Disease_State = disease_state)
PCA_samples <- PCA_pheno_origin %>% rbind(PCA_pheno_TS1)
#counts data - use PCA_counts_paper_1
has_neg <- apply(PCA_counts_paper, 1, function(row) any(row < 0))
genes_with_negative_counts_df <- which(has_neg) %>% as.data.frame() %>% tibble::rownames_to_column("Gene")
length(which(has_neg))
genes_with_negative_counts_vector <- genes_with_negative_counts_df [, "Gene"]
genes_with_negative_counts <- dput(as.character(genes_with_negative_counts_vector))
PCA_counts_paper_1 <- PCA_counts_paper[!rownames(PCA_counts_paper) %in% genes_with_negative_counts,]
#genes data - use PCA_genes_paper_2
PCA_counts_paper_gene_df <- PCA_counts_paper_1 %>% tibble::rownames_to_column("Gene")
PCA_counts_paper_gene_vector <- PCA_counts_paper_gene_df[, "Gene"]
PCA_counts_paper_gene <- dput(as.character(PCA_counts_paper_gene_vector))
PCA_genes_paper_1 <- PCA_genes_paper %>% distinct(ID, .keep_all=TRUE)%>% mutate(ID_1 = ID) %>% column_to_rownames(var="ID") %>% dplyr::rename(ID= ID_1) %>% relocate(ID, .before=Symbol)
PCA_genes_paper_2 <- PCA_genes_paper_1[rownames(PCA_genes_paper_1) %in% PCA_counts_paper_gene,]
#samples data - use PCA_samples_paper
PCA_samples_paper_sampleid <- dput(as.character(colnames(PCA_counts_paper)))
PCA_samples_paper <- PCA_samples[rownames(PCA_samples) %in% PCA_samples_paper_sampleid,]
#plot PCA
dgelist <- DGEList(counts = PCA_counts_paper_1, samples = PCA_samples_paper, genes = PCA_genes_paper_2)
normalizationfactors <- calcNormFactors(dgelist, method = "TMM")
keep <- filterByExpr(normalizationfactors)
keep.1 <- normalizationfactors[keep,]
cpm <- cpm(dgelist)
lcpm <- cpm(dgelist, log = T)
pca <- prcomp(t(lcpm), scale=T)
pcaplot <- autoplot(pca, data = PCA_samples_paper, colour = "Disease_State", shape = "Data")
figure_paper_pca <-include_graphics("Paper_PCA.png")
figure_pca <- include_graphics("plot_pca.png")
include_graphics(c(figure_paper_pca, figure_pca))
Figure 4. Principal Component Analysis of the Training Set and Test Set 1. Left: The Published Paper. Right: Replication Result.
svm_score_TS1 <- predict(fit_origin_new, TS1_data_origin_common, decision.values = TRUE)
SVM_Scores_TS1 <- read.csv("SVM_Scores_TS1_New.csv")%>% dplyr::rename(Sample_ID = 1) %>% mutate(Predicted_Illness = case_when(SVM_Score <0 ~ "TB (2 mo)", SVM_Score>=0 ~ "Controls")) %>% left_join(Pheno_TS1_new,by=c("Sample_ID" = "geo_accession")) %>% dplyr::select(Sample_ID, SVM_Score, Predicted_Illness, `time post initiation of treatment:ch1`) %>% dplyr::rename(illness = `time post initiation of treatment:ch1`) %>% mutate(illness= case_when(illness=="0_months" ~ "Controls", illness == "2_months" ~ "TB (2 mo)"))
colour_graph_TS1 <- c(rep("#fafcfb",12), rep("#636363", 7))
plot_TS1<- ggplot(SVM_Scores_TS1, aes(x= reorder(Sample_ID, -SVM_Score), y=SVM_Score)) + geom_col(stat = "identity", fill = colour_graph_TS1, colour="black") + theme_minimal()+ labs(title = "Test Set 1", y = "SVM Score") + theme(plot.title=element_text(hjust=0.5, size=15), axis.title.x = element_text(hjust = 0.5), axis.title.y = element_text(size=15)) + theme(axis.title.x=element_blank(),axis.text.x=element_blank(),axis.ticks.x=element_blank())
figure_paper_svm_ts1 <- include_graphics("Paper_SVM_TS1.png")
figure_svm_ts1 <- include_graphics("plot_svm_ts1.png")
include_graphics(c(figure_paper_svm_ts1, figure_svm_ts1))
Figure 5. Support Vector Machine Scores for Test Set 1. Left: The Published Paper. Right: Replication Result.
svm_score_TS2 <- predict(fit_origin_new, TS2_data_origin_common, decision.values = TRUE)
SVM_Scores_TS2<- read.csv("SVM_Scores_TS2_New.csv")%>% dplyr::rename(Sample_ID = 1)%>% mutate(Predicted_Illness = case_when(SVM_Score <0 ~ "TB (12 mo)", SVM_Score>=0 ~ "TB")) %>% left_join(Pheno_TS2,by=c("Sample_ID" = "geo_accession")) %>% dplyr::select(Sample_ID, SVM_Score, Predicted_Illness, `time post initiation of treatment:ch1`) %>% dplyr::rename(illness = `time post initiation of treatment:ch1`) %>% mutate(illness= case_when(illness=="0_months" ~ "TB", illness == "12_months" ~ "TB (12 mo)"))
colour_graph_TS2 <- c(rep("#fafcfb",1), rep("#636363", 1),rep("#fafcfb",1),rep("#636363", 1),rep("#fafcfb",3),rep("#636363", 3), rep("#fafcfb", 1), rep("#636363", 1),rep("#fafcfb", 1), rep("#636363", 1))
plot_ts2 <- ggplot(SVM_Scores_TS2, aes(x= reorder(Sample_ID, -SVM_Score), y=SVM_Score)) + geom_col(stat = "identity", fill = colour_graph_TS2, colour="black") + theme_minimal() + labs(title = "Test Set 2", y = "SVM Score") + theme(plot.title=element_text(hjust=0.5, size=15), axis.title.x = element_text(hjust = 0.5), axis.title.y = element_text(size=15)) + theme(axis.title.x=element_blank(),axis.text.x=element_blank(),axis.ticks.x=element_blank())
figure_paper_svm_ts2 <- include_graphics("Paper_SVM_TS2.png")
figure_svm_ts2 <- include_graphics("plot_ts2.png")
include_graphics(c(figure_paper_svm_ts2, figure_svm_ts2))
Figure 6. Support Vector Machine Scores for Test Set 2. Left: The Published Paper. Right: Replication Result.
#svm TS3
svm_score_TS3 <- predict(fit_origin_new, TS3_data_origin_common, decision.values = TRUE)
SVM_Scores_TS3<- read.csv("SVM_Scores_TS3_New.csv")%>% dplyr::rename(Sample_ID = 1) %>% mutate(Predicted_Illness =case_when(SVM_Score <0 ~ "TB", SVM_Score>=0 ~ "Controls/Latent TB")) %>% left_join(Pheno_TS3_illnessupdate,by=c("Sample_ID" = "geo_accession")) %>% dplyr::select(Sample_ID, SVM_Score, Predicted_Illness, illness) %>% na.omit()
#svm TS4
svm_score_TS4 <- predict(fit_origin_new, TS4_data_origin_common, decision.values = TRUE)
SVM_Scores_TS4<- read.csv("SVM_Scores_TS4_New.csv")%>% dplyr::rename(Sample_ID = 1) %>% mutate(Predicted_Illness = case_when(SVM_Score <0 ~ "Controls/Latent TB ", SVM_Score>=0 ~ "TB")) %>% left_join(Pheno_TS4_illnessupdate,by=c("Sample_ID" = "geo_accession")) %>% dplyr::select(Sample_ID, SVM_Score, Predicted_Illness, illness)
#svm TS5
svm_score_TS5 <- predict(fit_origin_new, TS5_data_origin_common, decision.values = TRUE)
SVM_Scores_TS5<- read.csv("SVM_Scores_TS5_New.csv")%>% dplyr::rename(Sample_ID = 1) %>% mutate(Predicted_Illness = case_when(SVM_Score <0 ~ "Latent TB", SVM_Score>=0 ~ "TB")) %>% left_join(Pheno_TS5_illnessupdate,by=c("Sample_ID" = "geo_accession")) %>% dplyr::select(Sample_ID, SVM_Score, Predicted_Illness, illness)
#svm TS6
svm_score_TS6 <- predict(fit_origin_new, TS6_data_origin_common, decision.values = TRUE)
SVM_Scores_TS6<- read.csv("SVM_Scores_TS6_New.csv")%>% dplyr::rename(Sample_ID = 1) %>% mutate(Predicted_Illness = case_when(SVM_Score <0 ~ "Latent TB", SVM_Score>=0 ~ "TB")) %>% left_join(Pheno_TS6_illnessupdate,by=c("Sample_ID" = "geo_accession")) %>% dplyr::select(Sample_ID, SVM_Score, Predicted_Illness, illness)
#roc curve training
#Origin
SVM_Scores_Origin_df <- SVM_Scores_Origin_New %>% mutate(predictions = case_when(SVM_Score <0 ~ 0, SVM_Score >=0 ~1)) %>% mutate(labels = case_when(illness == "HIV" ~ 1, illness == "HIV/TB co-infection" ~ 0))
SVM_Scores_Origin_pred <- prediction(SVM_Scores_Origin_df$predictions, SVM_Scores_Origin_df$labels)
SVM_Scores_Origin_perf <- performance(SVM_Scores_Origin_pred,"tpr","fpr")
#plot(SVM_Scores_Origin_perf,col='red', main="Training Set", lwd=3)
auc_ROCR_Origin <- performance(SVM_Scores_Origin_pred, measure = "auc")
auc_ROCR_Origin <- auc_ROCR_Origin@y.values[[1]]
SVM_Scores_TS1_df <- SVM_Scores_TS1 %>% mutate(predictions = case_when(SVM_Score <0 ~ 0, SVM_Score >=0 ~1)) %>% mutate(labels = case_when(illness == "Controls" ~ 1, illness == "TB (2 mo)" ~ 0))
SVM_Scores_TS1_pred <- prediction(SVM_Scores_TS1_df$predictions, SVM_Scores_TS1_df$labels)
SVM_Scores_TS1_perf <- performance(SVM_Scores_TS1_pred,"tpr","fpr")
#plot(SVM_Scores_TS1_perf,col='red', main="TS1", lwd=3)
auc_ROCR_TS1 <- performance(SVM_Scores_TS1_pred, measure = "auc")
auc_ROCR_TS1 <- auc_ROCR_TS1@y.values[[1]]
SVM_Scores_TS2_df <- SVM_Scores_TS2 %>% mutate(predictions = case_when(SVM_Score <0 ~ 0, SVM_Score >=0 ~1)) %>% mutate(labels = case_when(illness == "TB" ~ 1, illness == "TB (12 mo)" ~0))
SVM_Scores_TS2_pred <- prediction(SVM_Scores_TS2_df$predictions, SVM_Scores_TS2_df$labels)
SVM_Scores_TS2_perf <- performance(SVM_Scores_TS2_pred,"tpr","fpr")
#plot(SVM_Scores_TS2_perf,col='red', main="TS2", lwd=3)
auc_ROCR_TS2 <- performance(SVM_Scores_TS2_pred, measure = "auc")
auc_ROCR_TS2 <- auc_ROCR_TS2@y.values[[1]]
SVM_Scores_TS3_df <- SVM_Scores_TS3 %>% mutate(predictions = case_when(SVM_Score <0 ~ 0, SVM_Score >=0 ~1)) %>% mutate(labels = case_when(illness == "Controls/Latent TB" ~ 1, illness == "TB" ~ 0))
SVM_Scores_TS3_pred <- prediction(SVM_Scores_TS3_df$predictions, SVM_Scores_TS3_df$labels)
SVM_Scores_TS3_perf <- performance(SVM_Scores_TS3_pred,"tpr","fpr")
#plot(SVM_Scores_TS3_perf,col='red', main = "TS3", lwd=3)
auc_ROCR_TS3 <- performance(SVM_Scores_TS3_pred, measure = "auc")
auc_ROCR_TS3 <- auc_ROCR_TS3@y.values[[1]]
SVM_Scores_TS4_df <- SVM_Scores_TS4 %>% mutate(predictions = case_when(SVM_Score <0 ~ 0, SVM_Score >=0 ~1)) %>% mutate(labels = case_when(illness == "TB" ~ 1, illness == "Controls/Latent TB" ~ 0))
SVM_Scores_TS4_pred <- prediction(SVM_Scores_TS4_df$predictions, SVM_Scores_TS4_df$labels)
SVM_Scores_TS4_perf <- performance(SVM_Scores_TS4_pred,"tpr","fpr")
#plot(SVM_Scores_TS4_perf,col='red', main = "TS4", lwd=3)
auc_ROCR_TS4 <- performance(SVM_Scores_TS4_pred, measure = "auc")
auc_ROCR_TS4 <- auc_ROCR_TS4@y.values[[1]]
SVM_Scores_TS5_df <- SVM_Scores_TS5 %>% mutate(predictions = case_when(SVM_Score <0 ~ 0, SVM_Score >=0 ~1)) %>% mutate(labels = case_when(illness == "TB" ~ 1, illness == "Latent TB" ~ 0))
SVM_Scores_TS5_pred <- prediction(SVM_Scores_TS5_df$predictions, SVM_Scores_TS5_df$labels)
SVM_Scores_TS5_perf <- performance(SVM_Scores_TS5_pred,"tpr","fpr")
#plot(SVM_Scores_TS5_perf,col='red', main = "TS5", lwd=3)
auc_ROCR_TS5 <- performance(SVM_Scores_TS5_pred, measure = "auc")
auc_ROCR_TS5 <- auc_ROCR_TS5@y.values[[1]]
SVM_Scores_TS6_df <- SVM_Scores_TS6 %>% mutate(predictions = case_when(SVM_Score <0 ~ 0, SVM_Score >=0 ~1)) %>% mutate(labels = case_when(illness == "TB" ~ 1, illness == "Latent TB" ~ 0))
SVM_Scores_TS6_pred <- prediction(SVM_Scores_TS6_df$predictions, SVM_Scores_TS6_df$labels)
SVM_Scores_TS6_perf <- performance(SVM_Scores_TS6_pred,"tpr","fpr")
#plot(SVM_Scores_TS6_perf,col='red', main = "TS6", lwd=3)
auc_ROCR_TS6 <- performance(SVM_Scores_TS6_pred, measure = "auc")
auc_ROCR_TS6 <- auc_ROCR_TS6@y.values[[1]]
#par(mfrow=c(2,4))
#(SVM_Scores_Origin_perf,col='red', main="Training Set", lwd=3)
#text(x=0.74, y=0.05, labels="AUC:1.000", col="black", cex=1)
#plot(SVM_Scores_TS1_perf,col='red', main="TS1", lwd=3)
#text(x=0.74, y=0.05, labels="AUC:0.875", col="black", cex=1)
#plot(SVM_Scores_TS2_perf,col='red', main="TS2", lwd=3)
#text(x=0.74, y=0.05, labels="AUC:0.500", col="black", cex=1)
#plot(SVM_Scores_TS3_perf,col='red', main = "TS3", lwd=3)
#text(x=0.74, y=0.05, labels="AUC:0.739", col="black", cex=1)
#plot(SVM_Scores_TS4_perf,col='red', main = "TS4", lwd=3)
#text(x=0.74, y=0.05, labels="AUC:0.305", col="black", cex=1)
#plot(SVM_Scores_TS5_perf,col='red', main = "TS5", lwd=3)
#text(x=0.26, y=0.95, labels="AUC:0.186", col="black", cex=1)
#plot(SVM_Scores_TS6_perf,col='red', main = "TS6", lwd=3)
#text(x=0.74, y=0.05, labels="AUC:0.500", col="black", cex=1)
figure_paper_roc <- include_graphics("Paper_ROC.png")
plot_roc <- include_graphics("plot_roc.png")
include_graphics(c(figure_paper_roc, plot_roc))
Figure 7. ROC plots and AUC values for the Training Set, Test Set 1, Test Set 2, Test Set 3, Test Set 4, Test Set 5, Test Set 6. Left: The Published Paper. Right: Replication Result.
confusion.glm_Origin = confusionMatrix(
data = as.factor(SVM_Scores_Origin_df$predictions),
reference = as.factor(SVM_Scores_Origin_df$labels)
)
confusion.glm_TS1 = confusionMatrix(
data = as.factor(SVM_Scores_TS1_df$predictions),
reference = as.factor(SVM_Scores_TS1_df$labels)
)
confusion.glm_TS2 = confusionMatrix(
data = as.factor(SVM_Scores_TS2_df$predictions),
reference = as.factor(SVM_Scores_TS2_df$labels)
)
confusion.glm_TS3 = confusionMatrix(
data = as.factor(SVM_Scores_TS3_df$predictions),
reference = as.factor(SVM_Scores_TS3_df$labels)
)
confusion.glm_TS4 = confusionMatrix(
data = as.factor(SVM_Scores_TS4_df$predictions),
reference = as.factor(SVM_Scores_TS4_df$labels)
)
confusion.glm_TS5 = confusionMatrix(
data = as.factor(SVM_Scores_TS5_df$predictions),
reference = as.factor(SVM_Scores_TS5_df$labels)
)
confusion.glm_TS6 = confusionMatrix(
data = as.factor(SVM_Scores_TS6_df$predictions),
reference = as.factor(SVM_Scores_TS6_df$labels)
)
Performance_classifier <- read.csv("performance_of_classifier.csv") %>% dplyr::rename(` ` = 1, ` `=2, ` `=4, ` `=6, ` `=8, ` `=10) %>% mutate(Sensitivity=case_when(Sensitivity=="90"~"90.0",T~Sensitivity))%>%mutate(` `=case_when(` `=="100"~"100.0",T~` `))%>%mutate(Specificity=case_when(Specificity=="100"~"100.0",Specificity=="97"~"97.0",T~Specificity)) %>%mutate(` `=case_when(` `=="100"~"100.0",` `=="75"~"75.0",` `=="0"~"0.0",` `=="5"~"5.0",T~` `))%>%mutate(` `=case_when(` `=="100"~"100.0",` `=="50"~"50.0",T~` `))%>%mutate(AUC=case_when(AUC=="1"~"1.000",T~AUC))%>%mutate(` `=case_when(` `=="1"~"1.000",` `=="0.5"~"0.500",T~` `))
kable(Performance_classifier, caption = "Table 4. Performance of the Classifier Measured through Sensitivity, Specificity, Accuracy and AUC for the Training Set(HIV/TB), Test Set 1, Test Set 2, Test Set 3, Test Set 4, Test Set 5 and Test Set 6")
| Sensitivity | Specificity | Accuracy | AUC | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Dataset | Description | SVM | SVM(Reproduced) | SVM | SVM(Reproduced) | SVM | SVM(Reproduced) | SVM | SVM(Reproduced) |
| HIV/TB | HIV/TB vs. HIV | 76.2 | 100.0 | 86.4 | 100.0 | 81.4 | 100.0 | 0.864 | 1.000 |
| TS1 | TB (2 mo) vs. Controls | 85.7 | 100.0 | 100.0 | 75.0 | 94.7 | 75.0 | 1.000 | 0.875 |
| TS2 | TB vs. TB (12 mo) | 85.7 | 100.0 | 100.0 | 0.0 | 92.9 | 0.0 | 0.98 | 0.500 |
| TS3 | TB vs. Controls/Latent TB | 69.2 | 100.0 | 100.0 | 47.8 | 90.5 | 47.8 | 0.936 | 0.7391 |
| TS4 | TB vs. Controls/Latent TB | 76.2 | 51.5 | 97.0 | 9.5 | 88.9 | 9.5 | 0.893 | 0.305 |
| TS5 | TB vs. Latent TB | 90.0 | 32.3 | 90.3 | 5.0 | 90.2 | 5.0 | 0.924 | 0.186 |
| TS6 | TB vs. Latent TB | 86.2 | 100.0 | 92.1 | 0.0 | 89.6 | 0.0 | 0.967 | 0.500 |
Evaluation and analysis of the methods and results used to address the aim of finding a 251 gene signature for HIV/TB co-infection revealed that there exists uncertainties in regards to the accuracy and validity of the findings and methodology used in the published paper.
Attainment of the data used within the published paper revealed that normalized data had been provided as shown in Figure 1, therefore, without access to the raw data, the most suitable normalization method could not be determined.Thus the use of quantile normalization cannot be evaluated whether it is deemed appropriate or not.
Averaging the gene expression data of a patient where there existed duplication of their data was appropriate, however, it would be better to have the data collection procedure improved that would allow such duplication to not occur in the future.
Having a maximum fold change cut-off of 1.2 to determine the removal of genes was arbitrary with no explanation provided as to why 1.2 was chosen. Using this cut-off would mean that a slight upregulation or deregulation of a gene would lead to its removal. However, implementation of this step during replication revealed that no genes had a maximum fold change less than 1.2 causing skepticism as to whether this step would have resulted in the 2000 genes needing to be removed as conducted in the original paper. Rather, the use of the most variable genes, determined by the variance of its expression level, should be used as the cutoff instead.
Inconsistencies existed within the independent validation samples stemming from the external datasets. The only valid test set was Test Set 1 of which consisted of 7 patients with active TB and 12 healthy patients as the control. Test Set 2 consisted of all patients with active TB and Test Set 3,4,5 and 6 all had patients with latent TB as part of their control. With the aim being able to distinguish TB from non-TB, having no patients that do not have TB such as in Test Set 2 would deem this sample as inappropriate to validate the findings. Additionally, having latent TB patients as part of the control would be invalid upon determining the gene expression differences between TB and non-TB. Normalization of the external datasets of which involved adjusting for the differences in the test set’s mean against the overall mean is appropriate, but doesn’t allow for generalisability to other datasets. Any dataset that contains expression levels vastly different from the dataset the signature was detected from cannot be dealt with by using averages. This is a major issue for most genetic analyses and the best approach for dealing with such an issue was unable to be determined.
The top 20 genes found in the paper are displayed in Table 1 and the top 20 genes found upon replication are displayed in Table 2, where only 3 genes were similar as shown in Table 3. The 251 gene signature found in the paper could not be replicated where only 84 genes were found to be common. A poor outline of the working process of the classifier and no hyperparameters of the model were given. Additionally, no criteria was given on how they based their ranking of genes.
A heatmap based on the gene expression of the training set was created as shown in Figure 3. Both heatmaps in the paper and upon replication were able to cluster the samples into two groups, however whilst the scale suggests that it was fold changes, there was no specification on what the input values of the heatmap were. Principal component analysis on the training set and Test Set 1 resulted in the samples being similarly clustered with the same trend based on treatment found following the order of controls, 12 months of TB treatment, 2 months of TB treatment and TB itself.
A support vector machine classifier was built using the paper’s 251 gene signature. Inspection of the 6 test sets revealed that not all of the 251 gene signature could be found, therefore, the classifier was built on the 231 common genes instead. The majority of the samples could be classified as HIV/TB co-infected versus HIV only using the paper’s classifier however all samples were correctly classified as HIV/TB or HIV only using the replication’s classifier.
The SVM classifier was validated against all the test sets with the SVM scores for Test Set 1 plotted in Figure 5 and Test Set 2 plotted in Figure 6. No details in regards to what the error bars represent were given. Both the paper and the replication’s classifier could correctly classify the majority of patients as either TB-infected or not for Test Set 1. For Test Set 2, the paper’s classifier was able to distinguish between the two groups of active TB patients with no treatment and active TB patients on 12 months of treatment. However, the replication’s classifier was unable to classify these two groups separately. This could indicate that the replication’s classifier would be more appropriate in being able to classify TB from non-TB as all the samples were classified into the same group due to all the samples having active TB in Test Set 2.
ROC plots were created and AUC values were found for the training set and all the test sets as shown in Figure 7. The sensitivity, specificity, accuracy and AUC values were found and tabulated in Table 4. The low AUC and accuracy values from the replication’s classifier for Test Set 2,3,4,5 and 6 could be due to the fact that active TB patients existed within Test Set 2 and latent TB patients existed within Test Set 3,4,5 and 6. Due to training the classifier on a dataset with the difference being TB and non-TB infection, it would be expected that low AUC and accuracy values would be returned due to the fact that the majority or all of the samples had TB in Test Set 2,3,4,5 and 6. Rather, the high AUC values for the paper’s classifier would indicate that it does not successfully distinguish TB patients from non-TB patients.
Mycobacterium tuberculosis continues to be an important disease to detect to ensure the right treatment regime is given to affected patients. There existed great difficulty in reproducing the results found in the published paper titled ‘Identification of a 251 Gene Expression Signature That Can Accurately Detect M. tuberculosis in Patients with and without HIV Co-Infection’. The 251 gene expression signature could not be successfully found through SVM-RFE and the signature did not exist in all the test sets. Difficulty in replication could be partly due to a lack of information given in regards to methodology such as no details in regards to what certain features, such as error bars, represent, or how a ranking criteria was formed. However, the other reason is that there were clear errors in the study methodology and design. The use of test sets where the majority or all the samples had TB was incorrect. The accuracy and validity of finding a gene expression signature that can distinguish between TB and non-TB could be improved in the future by training and testing on multiple datasets consisting of both TB and non-TB patients where there would be a reduction in any hidden elements in HIV patients that could confound TB prediction.